Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.
Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Machine learning is closely related to and often overlaps with computational statistics; a discipline which also focuses in prediction-making through the use of computers.
Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics.
Statistical learning (or statistical machine learning) is largely about using statistical modeling ideas to solve machine learning problems.
“Learning” basically means using data to build or fit models.
From An Introduction to Statistical Learning:
“Statistical learning refers to a vast set of tools for understanding data.”
“Though the term statistical learning is fairly new, many of the concepts that underlie the field were developed long ago.”
“Inspired by the advent of machine learning and other disciplines, statistical learning has emerged as a new subfield in statistics, focused on supervised and unsupervised modeling and prediction.”
Suppose we observe data \((x_{11}, x_{21}, \ldots, x_{d1}, y_1), \ldots\), \((x_{1n}, x_{2n}, \ldots, x_{dn}, y_n)\). We have a response variable \(y_i\) and \(d\) explanatory variables \((x_{1i}, x_{2i}, \ldots, x_{di})\) per unit of observation.
Ordinary least squares models the variation of \(y\) in terms of \(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_d x_d\).
The assumed model is
\[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots + \beta_d X_{di} + E_i\]
where \({\rm E}[E_i] = 0\), \({\rm Var}(E_i) = \sigma^2\), and \(\rho_{E_i, E_j} = 0\) for all \(1 \leq i, j \leq n\) and \(i \not= j\).
Let’s collapse \(X_i = (X_{1i}, X_{2i}, \ldots, X_{di})\). A more general model is
\[Y_i = f(X_i) + E_i,\]
with the same assumptions on \(E_i\), for some function \(f\) that maps the \(d\) variables into the real numbers.
Figure credit: ISL
Figure credit: ISL
Figure credit: ISL
Figure credit: ISL
Input variables \((X_{1}, X_{2}, \ldots, X_{d})\):
Output variable \((Y)\):
Supervised learning is aimed at fitting models to \((X,Y)\) so that we can model the output \(Y\) given the input \(X\), typically on future observations. Prediction models are built by supervised learning.
Unsupervised learning (next week’s topic) is aimed at fitting models to \(X\) alone to charcaterize the distribution of or find patterns in \(X\).
We often want to fit \(Y = f(X) + E\) for either prediction or inference.
When observed \(x\) are readily available but \(y\) is not, the goal is usually prediction. If \(\hat{f}(x)\) is the estimated model, we predict \(\hat{y} = \hat{f}(x)\) for an observed \(x\). Here, \(\hat{f}\) is often treated as a black box and we mostly care that it provides accurate predictions.
When we co-observe \(x\) and \(y\), we are often interested in understanding the way that \(y\) is explained by varying \(x\) or is a causal effect of \(x\) – and we want to be able to explicitly quantify these relationships. This is the goal of inference. Here, we want to be able to estimate and interpret \(f\) as accurately as possible – and have it be as close as possible to the underlying real-world mechanism connecting \(x\) to \(y\).
When \(Y \in (-\infty, \infty)\), learning \(Y = f(X) + E\) is called regression.
When \(Y \in \{0,1\}\) or more generally \(Y \in \{c_1, c_2, \ldots, c_K\}\), we want to learn a function \(f(X)\) that takes values in \(\{c_1, c_2, \ldots, c_K\}\) so that \({\rm Pr}\left(Y=f(X)\right)\) is as large as possible. This is called classification.
A parametric model is a pre-specified form of \(f(X)\) whose terms can be characterized by a formula and interpreted. This usually involves parameters on which inference can be performed, such as coefficients in the OLS model.
A nonparametric model is a data-driven form of \(f(X)\) that is often very flexible and is not easily expressed or intepreted. A nonparametric model often does not include parameters on which we can do inference.
Let \(\hat{Y} = \hat{f}(X)\) be the output of the learned model. Suppose that \(\hat{f}\) and \(X\) are fixed. We can then define the error of this fitted model by:
\begin{eqnarray} {\rm E}\left[\left(Y - \hat{Y}\right)^2\right] & = & {\rm E}\left[\left(f(X) + E - \hat{f}(X)\right)^2\right] \\ \ & = & {\rm E}\left[\left(f(X) - \hat{f}(X)\right)^2\right] + {\rm Var}(E) \end{eqnarray}The term \({\rm E}\left[\left(f(X) - \hat{f}(X)\right)^2\right]\) is the reducible error and the term \({\rm Var}(E)\) is the irreducible error.
On an observed data set \((x_1, y_1), \ldots, (x_n, y_n)\) we usually calculate error rates as follows.
For regression, we calculate the mean-squared error:
\[\mbox{MSE} = \frac{1}{n} \sum_{i=1}^n \left(y_i - \hat{f}(x_i)\right)^2.\]
For classificaiton, we calculate the misclassification rate:
\[\mbox{MCR} = \frac{1}{n} \sum_{i=1}^n 1[y_i = \hat{f}(x_i)],\]
where \(1[\cdot]\) is 0 or 1 whether the argument is false or true, respectively.
We typically fit the model on one data set and then assess its accuracy on an independent data set.
The data set used to fit the model is called the training data set.
The data set used to test the model is called the testing data set or test data set.
Why do we need training and testing data sets to accurately assess a learned model’s accuracy?
How is this approach notably different from the inference approach we learned earlier?
Overfitting is a very important concept in statistical machine learning.
It occurs when the fitted model follows the noise term too closely.
In other words, when \(\hat{f}(X)\) is overfitting the \(E\) term in \(Y = f(X) + E\).
Figure credit: ISL
Figure credit: ISL
Figure credit: ISL
Figure credit: ISL
There are several important trade-offs encountered in prediction or learning:
These are not mutually exclusive phenomena.
Figure credit: ISL
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] broom_0.4.0 dplyr_0.4.3 ggplot2_2.1.0
[4] knitr_1.12.3 magrittr_1.5 devtools_1.10.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.4 mnormt_1.5-3 munsell_0.4.3
[4] lattice_0.20-33 colorspace_1.2-6 R6_2.1.2
[7] stringr_1.0.0 plyr_1.8.3 tools_3.2.3
[10] revealjs_0.5.1 parallel_3.2.3 grid_3.2.3
[13] nlme_3.1-125 gtable_0.2.0 psych_1.5.8
[16] DBI_0.3.1 htmltools_0.3.5 yaml_2.1.13
[19] digest_0.6.9 assertthat_0.1 tidyr_0.4.1
[22] reshape2_1.4.1 formatR_1.3 memoise_1.0.0
[25] evaluate_0.8.3 rmarkdown_0.9.5.9 stringi_1.0-1
[28] scales_0.4.0